|
|
@@ -21,19 +21,19 @@ module Agents
|
21
|
21
|
|
22
|
22
|
To tell the Agent how to parse the content, specify `extract` as a hash with keys naming the extractions and values of hashes.
|
23
|
23
|
|
24
|
|
- When parsing HTML or XML, these sub-hashes specify how to extract with either a `css` CSS selector or a `xpath` XPath expression and either `'text': true` or `attr` pointing to an attribute name to grab. An example:
|
|
24
|
+ When parsing HTML or XML, these sub-hashes specify how to extract with either a `css` CSS selector or a `xpath` XPath expression and either `"text": true` or `attr` pointing to an attribute name to grab. An example:
|
25
|
25
|
|
26
|
|
- 'extract': {
|
27
|
|
- 'url': { 'css': "#comic img", 'attr': "src" },
|
28
|
|
- 'title': { 'css': "#comic img", 'attr': "title" },
|
29
|
|
- 'body_text': { 'css': "div.main", 'text': true }
|
|
26
|
+ "extract": {
|
|
27
|
+ "url": { "css": "#comic img", "attr": "src" },
|
|
28
|
+ "title": { "css": "#comic img", "attr": "title" },
|
|
29
|
+ "body_text": { "css": "div.main", "text": true }
|
30
|
30
|
}
|
31
|
31
|
|
32
|
32
|
When parsing JSON, these sub-hashes specify [JSONPaths](http://goessner.net/articles/JsonPath/) to the values that you care about. For example:
|
33
|
33
|
|
34
|
|
- 'extract': {
|
35
|
|
- 'title': { 'path': "results.data[*].title" },
|
36
|
|
- 'description': { 'path': "results.data[*].description" }
|
|
34
|
+ "extract": {
|
|
35
|
+ "title": { "path": "results.data[*].title" },
|
|
36
|
+ "description": { "path": "results.data[*].description" }
|
37
|
37
|
}
|
38
|
38
|
|
39
|
39
|
Note that for all of the formats, whatever you extract MUST have the same number of matches for each extractor. E.g., if you're extracting rows, all extractors must match all rows. For generating CSS selectors, something like [SelectorGadget](http://selectorgadget.com) may be helpful.
|
|
|
@@ -155,7 +155,7 @@ module Agents
|
155
|
155
|
when xpath = extraction_details['xpath']
|
156
|
156
|
nodes = doc.xpath(xpath)
|
157
|
157
|
else
|
158
|
|
- error "'css' or 'xpath' is required for HTML or XML extraction"
|
|
158
|
+ error '"css" or "xpath" is required for HTML or XML extraction'
|
159
|
159
|
return
|
160
|
160
|
end
|
161
|
161
|
unless Nokogiri::XML::NodeSet === nodes
|
|
|
@@ -168,7 +168,7 @@ module Agents
|
168
|
168
|
elsif extraction_details['text']
|
169
|
169
|
node.text()
|
170
|
170
|
else
|
171
|
|
- error "'attr' or 'text' is required on HTML or XML extraction patterns"
|
|
171
|
+ error '"attr" or "text" is required on HTML or XML extraction patterns'
|
172
|
172
|
return
|
173
|
173
|
end
|
174
|
174
|
}
|